esri / geoportal-server-harvester Goto Github PK
View Code? Open in Web Editor NEWMetadata Harvester for Esri Geoportal Server
Home Page: http://esri.github.io/geoportal-server/
License: Apache License 2.0
Metadata Harvester for Esri Geoportal Server
Home Page: http://esri.github.io/geoportal-server/
License: Apache License 2.0
I have instances of GeoPortal Harvester and Geonetwork both running on Tomcat on my local machine. I am trying to harvest the Geonetwork data with Harvester. I created a CSW broker, pointed it to my geonetwork instance, set up a task... but it will not run. Checking the log, I get this error:
30-Jul-2018 15:44:35.656 SEVERE [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1 Error harvesting of PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=http://localhost:8081/geonetwork/srv/eng/csw, cred-username=admin, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:GeoNetwork], DESTINATIONS: [FOLDER[folder-root-folder=C:\Workspace\G.Young\HubProject\GeoPortal\Harvested, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false
com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.hasNext(CswBroker.java:165)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:150)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.http.client.HttpResponseException: Unauthorized
at com.esri.geoportal.commons.csw.client.impl.Client.findRecords(Client.java:127)
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.hasNext(CswBroker.java:141)
... 2 more
I am able to successfully harvest a CSW from an external node on the web. Does this error have something to do with working with localhost?
Feature:
Allow for the creation/linking of a catalog/source record to a harvest
Setup a link to that record
USGS sciencebase is a large collection.
Tried twice. Crashed at 32k and 245k records of 6000k.
Need a new techniques with large collections.
https://my.usgs.gov/confluence/display/sciencebase/Catalog+Services
moving issue from catalog to here:
Esri/geoportal-server-catalog#67
I'm trying to set a schedule for harvest jobs, using harvester release 2.5.1. I've set up several tasks, and I want to schedule them. I'm working with a new VM, CentOS 7.3., geoportal 2.5.0 runnign on same machine. SELunix is set to permissive. Harvesting works when I run the tasks (the input brokers are WAF, Output is localhost geoportal).
When I click on 'schedule' a little modal dialog opens with titled 'Task', with a 'Type' combo box and a 'Submit' button. The menu in the 'Type' combo box is empty. Clicking submit does nothing.
In the hrvyyy-mm-dd.log file, I see this message that appears to be related to trying to schedule:
... SEVERE [http-nio-8080-exec-1] org.apache.catalina.core.StandardWrapperValve.invoke Servlet.service() for servlet [spring] in context with path [/harvester] threw exception [Request processing failed; nested exception is java.util.MissingResourceException: Can't find bundle for base name EngineResource, locale en_US] with root cause
...
at com.esri.geoportal.harvester.engine.triggers.AtTrigger.getTemplate(AtTrigger.java:95
)
I'm trying to harvest ISO xml from a web accessible folder, with gpt 2 .? and harvester 2.5. The xml files from the WAF seem to be getting to the harvester, but there's some problem when it gets to a JSON mapping step; Cause is "No content to map due to end-of-input". The log files aren't providing much to go on. I set logging.properties to 'ALL' for the harvester, still only getting errors and Info.
Help!
log extracts from hrv log
17-Apr-2017 16:52:18.260 INFO [https-jsse-nio-8443-exec-4] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor.createProcess SUBMITTING: PROCESSOR: DEFAULT[], SOURCE: WAF[waf-host-url=http://get.iedadata.org/metadata/iso/usap/, waf-pattern=, cred-username=, cred-password=], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=, gpt-index=, gpt-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true
then... (some entries omitted)
com.esri.geoportal.harvester.api.ex.DataOutputException: Error publishing data: id: http://get.iedadata.org/metadata/iso/usap/600025iso.xml, modified: Wed Apr 05 15:43:19 EDT 2017, source URI: http://get.iedadata.org/metadata/iso/usap/600025iso.xml, broker URI: WAF:http://get.iedadata.org/metadata/iso/usap/
at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:164)
at com.esri.geoportal.harvester.api.base.BrokerLinkActionAdaptor.push(BrokerLinkActionAdaptor.java:64)
.... skip some 'at's
at java.lang.Thread.run(Thread.java:745)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]
at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:216)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3833)
currently the ArcGIS Portal input broker requires providing an identity. The harvest only retrieves documents owned by this identity.
desired state is that without providing an identity, the harvester connects anonymously and only retrieves publicly shared items from the indicated portal/online subscription.
What is the best way to debug the cause of failed harvest entries?
After evaluating the hrv.log I get a count of the No. succeeded and the no. Failed.
Failed entries look like this:
Nov 03, 2016 3:52:41 PM com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess lambda$null$0
WARNING: Failed harvesting REF \\UNCPATH.xml | Mon Aug 15 09:15:28 MDT 2016 | \\UNCPATH.xml during UNC [\\UNCPATH] --> [GPT [http://server:8088/geoportal/]]
Nov 03, 2016 3:52:41 PM com.esri.geoportal.harvester.support.ReportLogger error
SEVERE: Error processing task: PROCESS:: status: working, title: UNC [\\UNCPATH] --> [GPT [http://server:8088/geoportal/]]
com.esri.geoportal.harvester.api.ex.DataOutputException: Error publishing data.
at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:141)
at com.esri.geoportal.harvester.api.base.BrokerLinkActionAdaptor.push(BrokerLinkActionAdaptor.java:64)
at com.esri.geoportal.harvester.api.base.SimpleLink.push(SimpleLink.java:71)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$null$0(DefaultProcessor.java:135)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:133)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Bad Request
at com.esri.geoportal.commons.gpt.client.Client.publish(Client.java:131)
at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:131)
... 7 more
inside the geoportal.log file I see the entries:
2016-11-03 15:52:34,293 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,316 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,332 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,349 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,471 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,877 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,913 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,939 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:35,091 DEBUG [com.esri.geoportal.base.util.DateUtil] - Bad ISO date: REQUIRED: The year (and optionally month, or month and day) for which the data set corresponds to the ground.
2016-11-03 15:52:35,093 DEBUG [com.esri.geoportal.base.util.DateUtil] - Bad ISO date: REQUIRED: The year (and optionally month, or month and day) for which the data set corresponds to the ground.
2016-11-03 15:52:35,527 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception.
Basically tons of Unrecognized metadata type and Validation exception errors. For Validation errors, how would I know what is failing? The Bad ISO date error is great as it clearly tells me whats going wrong.
Thanks!
This wasn't an issue in previous versions but the latest code seems to have problems harvesting subfolders.
I have tried to harvest from a ROOT directory through WAF and UNC and each time it appears to only harvest the contents directly inside the ROOT directory and does not go through the rest of nested folders.
The only thing I can think of is that there a few folders that my account would not have access rights to, harvester could be trying these folders and failing. But 95% of these folders are accessible.
`
Nov 09, 2016 5:18:57 PM com.esri.geoportal.harvester.support.ReportLogger completed
INFO: Completed processing task: PROCESS:: status: completed, title: WAF [http://HOSTNAME/G] --> [GPT [http://HOSTNAME:8088/geoportal/]]
Nov 09, 2016 5:18:57 PM com.esri.geoportal.harvester.support.ReportStatistics completed
INFO: Harvesting of PROCESS:: status: completed, title: WAF [http://HOSTNAME/G] --> [GPT [http://HOSTNAME:8088/geoportal/]] completed at Wed Nov 09 17:18:57 MST 2016. No. succeded: 0, no. failed: 2`
INFO: Harvesting of PROCESS:: status: working, title: UNC [\\GIS] --> [GPT [http://HOSTNAME:8088/geoportal/]] started at Wed Nov 09 17:17:46 MST 2016 Nov 09, 2016 5:17:46 PM com.esri.geoportal.harvester.support.ReportLogger completed INFO: Completed processing task: PROCESS:: status: completed, title: UNC [\\GIS] --> [GPT [http://cal8783:8088/geoportal/]] Nov 09, 2016 5:17:46 PM com.esri.geoportal.harvester.support.ReportStatistics completed INFO: Harvesting of PROCESS:: status: completed, title: UNC [\\GIS] --> [GPT [http://cal8783:8088/geoportal/]] completed at Wed Nov 09 17:17:46 MST 2016. No. succeded: 0, no. failed: 0
Hi,
You have a downloadable .war file for the catalog app, can you please provide one for the harvester for those of us unable to compile our own.
Thanks,
Marc
We have an option for a daily task, however theres no info on what time this takes place.
I would like to be able to set the exact time of the daily task. Is there a way to go about doing this?
Some suggest self signed. This is signed by rapid ssl.
com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.hasNext(CswBroker.java:189)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$11(DefaultProcessor.java:150)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$172/17671274.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
...
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:145)
fails
https://mercury.ornl.gov/oai/provider?verb=ListIdentifiers&metadataPrefix=fgdc&set=
works
https://mercury.ornl.gov/oai/provider?verb=ListIdentifiers&metadataPrefix=fgdc
if set is empty, do not send.
The output to folder puts items in folders
so if I say /opt/tomcat/webapps/metadata
for a run for against an service an items put into
/opt/tomcat/webapps/metadata/opentopo.sdsc.edu/
which is the base name of the host.
If I set perform cleanup, it zaps all items in metadata, and not just the host folder.
Option 1: just clean up folder when items will be placed
option 2: use input broker name
I got a hardcore tomcat memory crash after loading 220k files from a single source.
Elastic search was still going fine.
Trying to harvest a folder path through UNC
c:\harvest and \hostname\harvest
I get a validation error on every file (thousands)
The same thing happens if I try it through WAF.
These files upload without issue through Geoportal and give no validation issues. I've pulled the latest commit and compiled it.
From the logs:
17-Jul-2017 16:50:22.574 WARNING [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$null$0 Failed harvesting id: C:\Harvest\calna1\gisdata_\GIS\GIS_Services\Canada\Regional\Technical_Graphics\T00603.ai.xml, modified: Mon Jul 17 11:35:58 MDT 2017, source URI: file:///C:/Harvest/calna1/gisdata_/GIS/GIS_Services/Canada/Regional/Technical_Graphics/T00603.ai.xml, broker URI: UNC:Harvest/calna1/gisdata_/GIS/GIS_Services/Canada during PROCESSOR: DEFAULT[], SOURCE: UNC[unc-root-folder=C:\Harvest\calna1\gisdata_\GIS\GIS_Services\Canada, unc-pattern=], DESTINATIONS: [GPT[gpt-host-url=http://calwgist08:8080/gp2, cred-username=gptadmin, cred-password=*****, gpt-index=, gpt-cleanup=false, gpt-accept-xml=true, gpt-accept-json=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false 17-Jul-2017 16:50:22.574 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: UNC[unc-root-folder=C:\Harvest\calna1\gisdata_\GIS\GIS_Services\Canada, unc-pattern=], DESTINATIONS: [GPT[gpt-host-url=http://calwgist08:8080/gp2, cred-username=gptadmin, cred-password=*****, gpt-index=, gpt-cleanup=false, gpt-accept-xml=true, gpt-accept-json=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false | Validation exception. com.esri.geoportal.harvester.gpt.GptBroker$2: Validation exception. at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:175) at com.esri.geoportal.harvester.api.base.BrokerLinkActionAdaptor.push(BrokerLinkActionAdaptor.java:64) at com.esri.geoportal.harvester.api.base.SimpleLink.push(SimpleLink.java:71) at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$null$0(DefaultProcessor.java:140) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:138) at java.lang.Thread.run(Thread.java:748)
Geoportal.log
2017-07-17 17:19:53,253 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,269 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,284 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,300 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,316 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception.
I have tried to:
-Delete the index
-Restarted elasticsearch
-Restarted Tomcat
-Restarted Server
-Checked permissions on folder
-Recompiled Harvester
-Deleted the harvester.db files
Things that are working:
-Harvesting into a local folder
Collecting data from DLESE
http://uc.dls.ucar.edu/dds_oai_server/services/oaiDataProvider/oai_explorer.jsp
this says 7316 records
http://uc.dls.ucar.edu/oai?verb=ListIdentifiers&metadataPrefix=gmd
Server count now at 300k, even though entire site only has 36k records
http://uc.dls.ucar.edu/oai?verb=ListRecords&metadataPrefix=oai_dc
6 records a minute.
Returning 10 records at a time comes back in about 3 seconds (start 1, and 1001)
Saw this before with
http://search.geothermaldata.org/csw
and attributed it to some server/pyCSW issue.
Now it's feeling like it might be something on the harvester side.
IOOS had an issue, and has now fixed it. (turned out to be a python 3 string issue)
This maybe a newbie question but I can't find documentation on how to harvest map services. We have a number of map services hosted on a local server with ArcGIS REST Services that I would like to harvest. What is the best way to go about doing this? Is this feature even implemented yet?
Pulled the latest master
Task>Run not working
Have scheduled some.
It would be great to have the same functionality as the previous version in terms of scheduling harvests. As well as the feature to have only new/updated metadata get harvested.
Does the latest pull of harvester work with the latest Geoportal?
Ability to harvest from OAI and DataOne endpoints preliminary implemented.
https://github.com/CINERGI/geoportal-server-harvester/tree/dataone
Will require changes to the catalog server to accept these data formats.
It would be useful to allow specification of a CQL or OGC Filter Query to filter a subset of records for harvesting from a CSW broker. For instance in my case I'd like to harvest only records from the GCMD catalog at https://cmr.earthdata.nasa.gov/csw/collections with a filter &CONSTRAINTLANGUAGE=CQL_TEXT&constraint=AnyText=NSF?PLR
I have certain subfolders that need to be ignored such as "confidential" and "Archive". Can this be done with the robots.txt file? I have tried a number of combinations that would normally work with a website to no avail. Please advise on the purpose of robots.txt with harvester and on how I could implement this. (Tried with WAF and UNC)
User-agent: *
Disallow: /Archive/
Allow for the setting of a contributor in the harvest source
I now understand why I was having great difficulties at Esri/geoportal-server-catalog#42
Then brought back my old index through a backup and the previous Geoportal worked again.
Basically it seems the new Harvester saves to the ES [metadata] index instead of [metadata_v1]
This took an entire day to figure out as I couldn't see anything about this in the logs.
24-Mar-2017 09:54:07.110 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportStatistics.completed Harvesting of PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CKAN[ckan-host-url=http://demo.ckan.org/, ckan-apikey=], DESTINATIONS: [FOLDER[folder-root-folder=D:\dev_odm\geoportal_metadata, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true completed at Fri Mar 24 09:54:07 PDT 2017. No. succeded: 0, no. failed: 0
Exception in thread "HARVESTING" java.lang.RuntimeException: javax.xml.stream.XMLStreamException: Unbound namespace URI 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:391)
at jdk.nashorn.api.scripting.ScriptObjectMirror.callMember(ScriptObjectMirror.java:192)
at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:381)
at jdk.nashorn.api.scripting.NashornScriptEngine.invokeFunction(NashornScriptEngine.java:187)
at com.esri.geoportal.commons.meta.js.BaseJSMetaBuilder.execute(BaseJSMetaBuilder.java:67)
at com.esri.geoportal.commons.meta.js.BaseJSMetaBuilder.create(BaseJSMetaBuilder.java:53)
at com.esri.geoportal.harvester.ckan.CkanBroker$CkanIterator.next(CkanBroker.java:224)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:137)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$590/4609631.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
Caused by: javax.xml.stream.XMLStreamException: Unbound namespace URI 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
at com.ctc.wstx.sw.SimpleNsStreamWriter.writeStartOrEmpty(SimpleNsStreamWriter.java:231)
at com.ctc.wstx.sw.BaseNsStreamWriter.writeStartElement(BaseNsStreamWriter.java:306)
at com.esri.geoportal.commons.meta.js.XmlBuilder.writeStartElement(XmlBuilder.java:78)
at jdk.nashorn.internal.scripts.Script$Recompilation$58$677A$\^eval\_.create(<eval>:30)
at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:638)
at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:229)
at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:387)
... 9 more
Minor improvement to the UI
What language is used when utilizing patterns? For example I want to harvest a folder but I don't want it to harvest from a specific subfolder called "Archive" what pattern would be required?
This would be useful info for a future Wiki.
Our organization has Arcgis rest services at domain/arcgis/rest
It automatically authenticates with our Windows AD credentials when navigating to the page.
When I try to harvest Map services I am getting the following error:
08-Aug-2017 09:31:23.880 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: AGS[ags-host-url=http://ngismap/arcgis/rest/services, cred-username=DOMAIN\\username, cred-password=*****, ags-enable-layers=false], DESTINATIONS: [GPT[gpt-host-url=http://GEOPORTAL:8080, cred-username=gptadmin, cred-password=*****, gpt-index=metadata_v1, gpt-cleanup=false, gpt-accept-xml=true, gpt-accept-json=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false | Error listing server content. com.esri.geoportal.harvester.api.ex.DataInputException: Error listing server content. at com.esri.geoportal.harvester.ags.AgsBroker.iterator(AgsBroker.java:135) at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:131) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.http.client.HttpResponseException: Unauthorized at com.esri.geoportal.commons.ags.client.AgsClient.listContent(AgsClient.java:118) at com.esri.geoportal.harvester.ags.AgsBroker.listResponses(AgsBroker.java:152) at com.esri.geoportal.harvester.ags.AgsBroker.iterator(AgsBroker.java:132) ... 2 more
Ive setup the authentication in Harvester as:
Username: DOMAIN\username also DOMAIN\username
Password: ****
I've tried to edit Map services Broker, saved, then deleted the task and readded it. Then ran the task.
This looks like a bug unless I'm missing something here. Username/PW are 100% correct. Thanks.
Automating the process of harvesting would be great. Would it be possible to connect harvester to a script that detects and changes in the folder?
For example if a new XML file is added to a harvested folder it would automatically detect it and begin harvesting.
I'm trying to setup a broker to grab items within our ArcGIS Online org. We have ArcGIS Online federated with our Active Directory. I'm giving my broker the root of our AGOL org http://austin.maps.arcgis.com and then putting in my federated email and password (at the moment I'm just testing, so my own login is okay). The task fails when using this broker.
My goal is to harvest just the items that are in our org AND have been shared with the entire org. Not sure if that filtering is possible with the harvester yet.
Thank you,
Pondering how to handle this in the codebase:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
A watched session will not trigger the rate limited.
Not quite correct implementation, so 503 might also be trapped:
http://www.openarchives.org/OAI/2.0/guidelines-repository.htm#FlowControlAndLoadBalancing
endpoint:
https://ws.pangaea.de/oai/provider
at about 60 records, we get a Too Many Requests.. http 429
19-Oct-2018 13:05:04.634 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: NAME: , PROCESSOR: DEFAULT[], SOURCE: OAI-PMH[oai-host-url=https://ws.pangaea.de/oai/provider, oai-prefix=iso19139, oai-set=], DESTINATIONS: [FOLDER-SPLIT[folder-root-folder=d:\metadata\, folder-split-folders=true, folder-split-size=1000, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true | Error reading data from: OAI [https://ws.pangaea.de/oai/provider]
com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data from: OAI [https://ws.pangaea.de/oai/provider]
at com.esri.geoportal.harvester.oai.pmh.OaiBroker.readContent(OaiBroker.java:147)
at com.esri.geoportal.harvester.oai.pmh.OaiBroker.access$200(OaiBroker.java:56)
at com.esri.geoportal.harvester.oai.pmh.OaiBroker$OaiIterator.next(OaiBroker.java:206)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$25(DefaultProcessor.java:154)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$129/22196721.run(Unknown Source)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Too Many Requests
at com.esri.geoportal.commons.oai.client.Client.readRecord(Client.java:154)
at com.esri.geoportal.harvester.oai.pmh.OaiBroker.readContent(OaiBroker.java:140)
... 5 more
Source name in
com.esri.geoportal.commons.gpt.client.PublishRequest
public final class PublishRequest {
public String src_source_type_s;
public String src_source_uri_s;
public String src_uri_s;
public String src_lastupdate_dt;
public String xml;
I got the v2.5.1 release war and deployed it to tomcat 8.0.48 on an Amazon VM CentOS 7.3.
the App starts OK, and web page opens http://52.54.48.218:8080/harvester/, when I try to create a broker (input or output), the dialog opens, I fill it out; when I click submit, nothing happens, and a red text message appears in the title bar 'error creating broker'.
In the hrv.2018-01-21.log file, I see:
INFO [http-nio-8080-exec-27] com.esri.geoportal.harvester.beans.BrokerDefinitionManagerBean.init Error initializing broker definition database org.h2.jdbc.JdbcSQLException: IO Exception: null [90028-196] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) at org.h2.message.DbException.get(DbException.java:168) ... Caused by: java.lang.IllegalStateException: Could not open file nio:/opt/tomcat/harvester.mv.db [1.4.196/1]
and similar message for 'trigger definition database' and 'task database' when the harvester context is loading. I don't see any /opt/tomcat/harvester.mv.db file, and never had this problem with my other harvester deployment on a CentOS 7.3 VM running at Linode.
Then when I'm trying to save the new broker, the problem 'Error creating broker definition', again
Caused by: java.lang.IllegalStateException: Could not open file nio:/opt/tomcat/harvester.mv.db [1.4.196/1]
Any thoughts on what is going on?
Esri/geoportal-server-catalog#9
Ability to harvest records and store on a disk.
Hello,
I am running into an issue where the harvest into AGOL and Portal for ArcGIS Server have the extents improperly populated. When I manually register a map service the extent is correctly set. Harvesting however seems to be setting the values in the original meters as listed in the service and not converting the number to Decimal degrees.
Is this a known issue or is there something i may be missing in terms of configuration?
Thank you
nmp
Using geoportal-server-harvester 2.5.2
As an administrator I want to setup a regular harvesting of a set of network file shares and subfolders for metadata files and publish these to a number of different targets.
dot dot causing issues (insert Mama Mia joke here)
https://coastwatch.pfeg.noaa.gov/erddap/metadata/iso19115/xml/
Exception in thread "HARVESTING" java.lang.IllegalArgumentException: Illegal character in fragment at index 69: https://coastwatch.pfeg.noaa.gov/erddap/metadata/iso19115/xml/..
at java.net.URI.create(URI.java:852)
Trying to see if I can harvest from data.gov, failing
http://catalog.data.gov/csw
https://catalog.data.gov/csw
http://catalog.data.gov/csw-all
https://catalog.data.gov/csw-all
If you add a task, it adds multiple outputs for the second, and third times
Hi,
I'm trying to migrate Geoportal 1.2.9 metadata backed by a SQL Server 2012 database to a Geoportal 2.6 instance. I'm pretty sure I got the context.xml configured correctly with the right parameters because I get all the registration records from the database, but no actual metadata. Is this correct behavior and I have to subsequently use each registration input broker to finally get all the metadata?
Thanks,
Rick
As an administrator I want to be able to setup a regular process of harvesting content from one or more sources into my ArcGIS Online subscription/Portal for ArcGIS instance.
currently harvesting of datagov content through CKAN only return a Dublin Core document, full metadata can be extracted through the following parameters:
the datagov ckan api returns harvest_object_id in the extras field, using that value you can get the xml at
/harvest/object/[harvest_object_id]
To tell whether a dataset is harvested from a XML source or datajson source, you can look into the extra field and look for key 'source_datajson_identifier'. If the value is 'true', then the source is from datasjon and the harvest object metadata will be in json format.
examples:
https://catalog.data.gov/api/3/action/package_show?id=u-s-hourly-precipitation-data
https://catalog.data.gov/harvest/object/ac0da4ab-0b88-48c4-af2b-00611df0d956
https://catalog.data.gov/api/3/action/package_show?id=demographic-statistics-by-zip-code-acfc9
https://catalog.data.gov/harvest/object/810835b2-f684-495a-8c0e-d24b15bd2154
Can you please clarify the installation instructions?
After downloading the zip file and unzipping it to C:\temp\geoportal-server-harvester-master I try:
cd c:\temp\geoportal-server-harvester-master
mvn clean install
This runs for 5 seconds then generates the error message:
[ERROR] Failed to execute goal on project geoportal-harvester-csw: Could not res
olve dependencies for project com.esri.geoportal:geoportal-harvester-csw:jar:1.0
.0-SNAPSHOT: Could not find artifact com.esri.geoportal:geoportal-commons-http:j
ar:1.0.0-SNAPSHOT -> [Help 1]
The full error log is here:
install_errors.txt
Am I doing something wrong, or is there a different problem? (This is my first time using MVN).
I'm using Apache Tomcat 8.0.35 and jdk1.8.0_92 on Windows 7. Thanks
Hi
In order to harvest an Entreprise Geodatabase, should I export all the metada into XML metadata files ?
Is It possible to create a connector ?
Best regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.