esri / geoportal-server-harvester Goto Github PK

View Code? Open in Web Editor NEW

31.0 21.0 24.0 50.79 MB

Metadata Harvester for Esri Geoportal Server

Home Page: http://esri.github.io/geoportal-server/

License: Apache License 2.0

Java 54.84% XSLT 39.16% JavaScript 4.76% HTML 0.73% CSS 0.51%

publishing-sharing geoportal geoportal-server esri-geoportal-server metadata harvester csw ckan waf data-management

geoportal-server-harvester's Introduction

geoportal-server-harvester

As part of the evolution of Geoportal Server, the harvesting capability has been separated into its own module. This is because there are use cases where the harvesting can be used as a stand-alone broker between catalogs of content.

This repository thus contains the harvesting capability, while it's sibling geoportal-server-catalog is the new catalog of Geoportal Server.

For details about geoportal server harvester, please visit the wiki.

To report an issue, please go to issues.

Limiting the Harvester's Reach

The nature of the Harvester application is, as the name suggests, to harvest metadata from whatever web endpoints it is provided. The list(s) of endpoints to download metadata from can also be provided by external entities over the internet. Neither the metadata being harvested nor the list(s) of endpoints provided by external entities are vetted or checked by the Harvester. Users who wish to limit the scope of the Harvester's reach should configure the network or machine where the Harvester is located with allow lists or deny lists of web endpoints to prevent the Harvester from reaching undesirable locations.

Releases and Downloads

2.7.1 - December 21, 2023, release notes.
2.7.0 - June 13, 2021.
2.6.5 - July 13, 2021.
2.6.4 - July 8, 2020.

Features

Please visit Features.

Instructions

Building the source code:

Run 'mvn clean install'

Building javadoc:

Run 'mvn javadoc::aggregate'

Deploying war file:

Deploy 'geoportal-application\geoportal-harvester-war\target\geoportal-harvester-war-<version>.war' into the web server of your choice.
No configuration required.

Requirements

Java JDK 11 (preferred: AdoptOpenJDK 11)
Apache Tomcat 9.x

Contributing

Esri welcomes contributions from anyone and everyone. Please see our guidelines for contributing.

Licensing

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

A copy of the license is available in the repository's LICENSE file.

geoportal-server-harvester's People

Contributors

Stargazers

Watchers

geoportal-server-harvester's Issues

Is this working with current Geoportal Server 2.5.0/ES 5?

Does the latest pull of harvester work with the latest Geoportal?

Support harvesting from UNC source

As an administrator I want to setup a regular harvesting of a set of network file shares and subfolders for metadata files and publish these to a number of different targets.

Add Tasks Shows Input Input

Иркутск

Support harvesting to ArcGIS Online/Portal

As an administrator I want to be able to setup a regular process of harvesting content from one or more sources into my ArcGIS Online subscription/Portal for ArcGIS instance.

Harvest ArcGIS Server into AGOL/Portal extent issues

Hello,
I am running into an issue where the harvest into AGOL and Portal for ArcGIS Server have the extents improperly populated. When I manually register a map service the extent is correctly set. Harvesting however seems to be setting the values in the original meters as listed in the service and not converting the number to Decimal degrees.
Is this a known issue or is there something i may be missing in terms of configuration?

Thank you
nmp

Using geoportal-server-harvester 2.5.2

Geoportal 2.6 Migration Tool only gets registration records

Hi,
I'm trying to migrate Geoportal 1.2.9 metadata backed by a SQL Server 2012 database to a Geoportal 2.6 instance. I'm pretty sure I got the context.xml configured correctly with the right parameters because I get all the registration records from the database, but no actual metadata. Is this correct behavior and I have to subsequently use each registration input broker to finally get all the metadata?

Thanks,

Rick

pycsw data.gov failing

Trying to see if I can harvest from data.gov, failing

http://catalog.data.gov/csw
https://catalog.data.gov/csw
http://catalog.data.gov/csw-all
https://catalog.data.gov/csw-all

Task>Run not working

Pulled the latest master
Task>Run not working

Have scheduled some.

Validation error on every file unc/waf

Trying to harvest a folder path through UNC
c:\harvest and \hostname\harvest
I get a validation error on every file (thousands)

The same thing happens if I try it through WAF.

These files upload without issue through Geoportal and give no validation issues. I've pulled the latest commit and compiled it.

From the logs:
17-Jul-2017 16:50:22.574 WARNING [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$null$0 Failed harvesting id: C:\Harvest\calna1\gisdata_\GIS\GIS_Services\Canada\Regional\Technical_Graphics\T00603.ai.xml, modified: Mon Jul 17 11:35:58 MDT 2017, source URI: file:///C:/Harvest/calna1/gisdata_/GIS/GIS_Services/Canada/Regional/Technical_Graphics/T00603.ai.xml, broker URI: UNC:Harvest/calna1/gisdata_/GIS/GIS_Services/Canada during PROCESSOR: DEFAULT[], SOURCE: UNC[unc-root-folder=C:\Harvest\calna1\gisdata_\GIS\GIS_Services\Canada, unc-pattern=], DESTINATIONS: [GPT[gpt-host-url=http://calwgist08:8080/gp2, cred-username=gptadmin, cred-password=*****, gpt-index=, gpt-cleanup=false, gpt-accept-xml=true, gpt-accept-json=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false 17-Jul-2017 16:50:22.574 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: UNC[unc-root-folder=C:\Harvest\calna1\gisdata_\GIS\GIS_Services\Canada, unc-pattern=], DESTINATIONS: [GPT[gpt-host-url=http://calwgist08:8080/gp2, cred-username=gptadmin, cred-password=*****, gpt-index=, gpt-cleanup=false, gpt-accept-xml=true, gpt-accept-json=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false | Validation exception. com.esri.geoportal.harvester.gpt.GptBroker$2: Validation exception. at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:175) at com.esri.geoportal.harvester.api.base.BrokerLinkActionAdaptor.push(BrokerLinkActionAdaptor.java:64) at com.esri.geoportal.harvester.api.base.SimpleLink.push(SimpleLink.java:71) at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$null$0(DefaultProcessor.java:140) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:138) at java.lang.Thread.run(Thread.java:748)

Geoportal.log
2017-07-17 17:19:53,253 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,269 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,284 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,300 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception. 2017-07-17 17:19:53,316 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception.

I have tried to:
-Delete the index
-Restarted elasticsearch
-Restarted Tomcat
-Restarted Server
-Checked permissions on folder
-Recompiled Harvester
-Deleted the harvester.db files

Things that are working:
-Harvesting into a local folder

Daily schedule - setting the time

We have an option for a daily task, however theres no info on what time this takes place.
I would like to be able to set the exact time of the daily task. Is there a way to go about doing this?

Using robots.txt to ignore subfolders?

I have certain subfolders that need to be ignored such as "confidential" and "Archive". Can this be done with the robots.txt file? I have tried a number of combinations that would normally work with a website to no avail. Please advise on the purpose of robots.txt with harvester and on how I could implement this. (Tried with WAF and UNC)

User-agent: *
Disallow: /Archive/

trouble with deploying harvester--broker definition database???

I got the v2.5.1 release war and deployed it to tomcat 8.0.48 on an Amazon VM CentOS 7.3.
the App starts OK, and web page opens http://52.54.48.218:8080/harvester/, when I try to create a broker (input or output), the dialog opens, I fill it out; when I click submit, nothing happens, and a red text message appears in the title bar 'error creating broker'.
In the hrv.2018-01-21.log file, I see:
INFO [http-nio-8080-exec-27] com.esri.geoportal.harvester.beans.BrokerDefinitionManagerBean.init Error initializing broker definition database org.h2.jdbc.JdbcSQLException: IO Exception: null [90028-196] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) at org.h2.message.DbException.get(DbException.java:168) ... Caused by: java.lang.IllegalStateException: Could not open file nio:/opt/tomcat/harvester.mv.db [1.4.196/1]
and similar message for 'trigger definition database' and 'task database' when the harvester context is loading. I don't see any /opt/tomcat/harvester.mv.db file, and never had this problem with my other harvester deployment on a CentOS 7.3 VM running at Linode.

Then when I'm trying to save the new broker, the problem 'Error creating broker definition', again
Caused by: java.lang.IllegalStateException: Could not open file nio:/opt/tomcat/harvester.mv.db [1.4.196/1]

Any thoughts on what is going on?

Issue with SSL

Some suggest self signed. This is signed by rapid ssl.

https://data.ioos.us/csw

com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
	at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.hasNext(CswBroker.java:189)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$11(DefaultProcessor.java:150)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$172/17671274.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:745)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
...
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:145)

Ckan harvest failure to a folder

https://data.noaa.gov/
http://demo.ckan.org/

The URI is registered in the Evaluator.js

24-Mar-2017 09:54:07.110 INFO [HARVESTING] com.esri.geoportal.harvester.support.ReportStatistics.completed Harvesting of PROCESS:: status: completed, title: PROCESSOR: DEFAULT[], SOURCE: CKAN[ckan-host-url=http://demo.ckan.org/, ckan-apikey=], DESTINATIONS: [FOLDER[folder-root-folder=D:\dev_odm\geoportal_metadata, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true completed at Fri Mar 24 09:54:07 PDT 2017. No. succeded: 0, no. failed: 0
Exception in thread "HARVESTING" java.lang.RuntimeException: javax.xml.stream.XMLStreamException: Unbound namespace URI 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
	at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:391)
	at jdk.nashorn.api.scripting.ScriptObjectMirror.callMember(ScriptObjectMirror.java:192)
	at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:381)
	at jdk.nashorn.api.scripting.NashornScriptEngine.invokeFunction(NashornScriptEngine.java:187)
	at com.esri.geoportal.commons.meta.js.BaseJSMetaBuilder.execute(BaseJSMetaBuilder.java:67)
	at com.esri.geoportal.commons.meta.js.BaseJSMetaBuilder.create(BaseJSMetaBuilder.java:53)
	at com.esri.geoportal.harvester.ckan.CkanBroker$CkanIterator.next(CkanBroker.java:224)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:137)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$590/4609631.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:745)
Caused by: javax.xml.stream.XMLStreamException: Unbound namespace URI 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
	at com.ctc.wstx.sw.SimpleNsStreamWriter.writeStartOrEmpty(SimpleNsStreamWriter.java:231)
	at com.ctc.wstx.sw.BaseNsStreamWriter.writeStartElement(BaseNsStreamWriter.java:306)
	at com.esri.geoportal.commons.meta.js.XmlBuilder.writeStartElement(XmlBuilder.java:78)
	at jdk.nashorn.internal.scripts.Script$Recompilation$58$677A$\^eval\_.create(<eval>:30)
	at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:638)
	at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:229)
	at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:387)
	... 9 more

OAI-PHM Too many requests

Pondering how to handle this in the codebase:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
A watched session will not trigger the rate limited.

Not quite correct implementation, so 503 might also be trapped:
http://www.openarchives.org/OAI/2.0/guidelines-repository.htm#FlowControlAndLoadBalancing

endpoint:
https://ws.pangaea.de/oai/provider

at about 60 records, we get a Too Many Requests.. http 429

19-Oct-2018 13:05:04.634 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: NAME: , PROCESSOR: DEFAULT[], SOURCE: OAI-PMH[oai-host-url=https://ws.pangaea.de/oai/provider, oai-prefix=iso19139, oai-set=], DESTINATIONS: [FOLDER-SPLIT[folder-root-folder=d:\metadata\, folder-split-folders=true, folder-split-size=1000, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true | Error reading data from: OAI [https://ws.pangaea.de/oai/provider]
 com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data from: OAI [https://ws.pangaea.de/oai/provider]
	at com.esri.geoportal.harvester.oai.pmh.OaiBroker.readContent(OaiBroker.java:147)
	at com.esri.geoportal.harvester.oai.pmh.OaiBroker.access$200(OaiBroker.java:56)
	at com.esri.geoportal.harvester.oai.pmh.OaiBroker$OaiIterator.next(OaiBroker.java:206)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$25(DefaultProcessor.java:154)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess$$Lambda$129/22196721.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Too Many Requests
	at com.esri.geoportal.commons.oai.client.Client.readRecord(Client.java:154)
	at com.esri.geoportal.harvester.oai.pmh.OaiBroker.readContent(OaiBroker.java:140)
	... 5 more

Harvest from localhost fails

I have instances of GeoPortal Harvester and Geonetwork both running on Tomcat on my local machine. I am trying to harvest the Geonetwork data with Harvester. I created a CSW broker, pointed it to my geonetwork instance, set up a task... but it will not run. Checking the log, I get this error:

30-Jul-2018 15:44:35.656 SEVERE [HARVESTING] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1 Error harvesting of PROCESSOR: DEFAULT[], SOURCE: CSW[csw-host-url=http://localhost:8081/geonetwork/srv/eng/csw, cred-username=admin, cred-password=*****, csw-profile-id=urn:ogc:CSW:2.0.2:HTTP:APISO:GeoNetwork], DESTINATIONS: [FOLDER[folder-root-folder=C:\Workspace\G.Young\HubProject\GeoPortal\Harvested, folder-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false
com.esri.geoportal.harvester.api.ex.DataInputException: Error reading data.
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.hasNext(CswBroker.java:165)
at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:150)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.http.client.HttpResponseException: Unauthorized
at com.esri.geoportal.commons.csw.client.impl.Client.findRecords(Client.java:127)
at com.esri.geoportal.harvester.csw.CswBroker$CswIterator.hasNext(CswBroker.java:141)
... 2 more

I am able to successfully harvest a CSW from an external node on the web. Does this error have something to do with working with localhost?

Harvesting Map Services

This maybe a newbie question but I can't find documentation on how to harvest map services. We have a number of map services hosted on a local server with ArcGIS REST Services that I would like to harvest. What is the best way to go about doing this? Is this feature even implemented yet?

Request: Scheduled harvesting

It would be great to have the same functionality as the previous version in terms of scheduling harvests. As well as the feature to have only new/updated metadata get harvested.

Map Services Authentication Fails

Our organization has Arcgis rest services at domain/arcgis/rest

It automatically authenticates with our Windows AD credentials when navigating to the page.

When I try to harvest Map services I am getting the following error:
08-Aug-2017 09:31:23.880 SEVERE [HARVESTING] com.esri.geoportal.harvester.support.ErrorLogger.logError Error processing task: PROCESS:: status: working, title: PROCESSOR: DEFAULT[], SOURCE: AGS[ags-host-url=http://ngismap/arcgis/rest/services, cred-username=DOMAIN\\username, cred-password=*****, ags-enable-layers=false], DESTINATIONS: [GPT[gpt-host-url=http://GEOPORTAL:8080, cred-username=gptadmin, cred-password=*****, gpt-index=metadata_v1, gpt-cleanup=false, gpt-accept-xml=true, gpt-accept-json=false]], INCREMENTAL: false, IGNOREROBOTSTXT: false | Error listing server content. com.esri.geoportal.harvester.api.ex.DataInputException: Error listing server content. at com.esri.geoportal.harvester.ags.AgsBroker.iterator(AgsBroker.java:135) at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:131) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.http.client.HttpResponseException: Unauthorized at com.esri.geoportal.commons.ags.client.AgsClient.listContent(AgsClient.java:118) at com.esri.geoportal.harvester.ags.AgsBroker.listResponses(AgsBroker.java:152) at com.esri.geoportal.harvester.ags.AgsBroker.iterator(AgsBroker.java:132) ... 2 more

Ive setup the authentication in Harvester as:
Username: DOMAIN\username also DOMAIN\username
Password: ****

I've tried to edit Map services Broker, saved, then deleted the task and readded it. Then ran the task.
This looks like a bug unless I'm missing something here. Username/PW are 100% correct. Thanks.

OAI not stopping

Collecting data from DLESE
http://uc.dls.ucar.edu/dds_oai_server/services/oaiDataProvider/oai_explorer.jsp

this says 7316 records
http://uc.dls.ucar.edu/oai?verb=ListIdentifiers&metadataPrefix=gmd

Server count now at 300k, even though entire site only has 36k records
http://uc.dls.ucar.edu/oai?verb=ListRecords&metadataPrefix=oai_dc

Is the harvester able to handle federated logins to ArcGIS Online?

I'm trying to setup a broker to grab items within our ArcGIS Online org. We have ArcGIS Online federated with our Active Directory. I'm giving my broker the root of our AGOL org http://austin.maps.arcgis.com and then putting in my federated email and password (at the moment I'm just testing, so my own login is okay). The task fails when using this broker.

My goal is to harvest just the items that are in our org AND have been shared with the entire org. Not sure if that filtering is possible with the harvester yet.

Thank you,

Catalog Records

Feature:
Allow for the creation/linking of a catalog/source record to a harvest

Setup a link to that record

GPT 2.0 as Source

Esri/geoportal-server-catalog#9

Ability to harvest records and store on a disk.

Memory issue

I got a hardcore tomcat memory crash after loading 220k files from a single source.

Elastic search was still going fine.

Subfolders not harvesting

This wasn't an issue in previous versions but the latest code seems to have problems harvesting subfolders.

I have tried to harvest from a ROOT directory through WAF and UNC and each time it appears to only harvest the contents directly inside the ROOT directory and does not go through the rest of nested folders.

The only thing I can think of is that there a few folders that my account would not have access rights to, harvester could be trying these folders and failing. But 95% of these folders are accessible.

Nov 09, 2016 5:18:57 PM com.esri.geoportal.harvester.support.ReportLogger completed
INFO: Completed processing task: PROCESS:: status: completed, title: WAF [http://HOSTNAME/G] --> [GPT [http://HOSTNAME:8088/geoportal/]]
Nov 09, 2016 5:18:57 PM com.esri.geoportal.harvester.support.ReportStatistics completed
INFO: Harvesting of PROCESS:: status: completed, title: WAF [http://HOSTNAME/G] --> [GPT [http://HOSTNAME:8088/geoportal/]] completed at Wed Nov 09 17:18:57 MST 2016. No. succeded: 0, no. failed: 2`

INFO: Harvesting of PROCESS:: status: working, title: UNC [\\GIS] --> [GPT [http://HOSTNAME:8088/geoportal/]] started at Wed Nov 09 17:17:46 MST 2016 Nov 09, 2016 5:17:46 PM com.esri.geoportal.harvester.support.ReportLogger completed INFO: Completed processing task: PROCESS:: status: completed, title: UNC [\\GIS] --> [GPT [http://cal8783:8088/geoportal/]] Nov 09, 2016 5:17:46 PM com.esri.geoportal.harvester.support.ReportStatistics completed INFO: Harvesting of PROCESS:: status: completed, title: UNC [\\GIS] --> [GPT [http://cal8783:8088/geoportal/]] completed at Wed Nov 09 17:17:46 MST 2016. No. succeded: 0, no. failed: 0

NOAA ERDDAP Waf issues

dot dot causing issues (insert Mama Mia joke here)

https://coastwatch.pfeg.noaa.gov/erddap/metadata/iso19115/xml/

Exception in thread "HARVESTING" java.lang.IllegalArgumentException: Illegal character in fragment at index 69: https://coastwatch.pfeg.noaa.gov/erddap/metadata/iso19115/xml/..
at java.net.URI.create(URI.java:852)

tasks panel Sort

Minor improvement to the UI

Order by source
Order by output

Source Name

Source name in
com.esri.geoportal.commons.gpt.client.PublishRequest

public final class PublishRequest {
public String src_source_type_s;
public String src_source_uri_s;
public String src_uri_s;
public String src_lastupdate_dt;
public String xml;

Harvest failing-- 'no content to map' at JSON mapping

I'm trying to harvest ISO xml from a web accessible folder, with gpt 2 .? and harvester 2.5. The xml files from the WAF seem to be getting to the harvester, but there's some problem when it gets to a JSON mapping step; Cause is "No content to map due to end-of-input". The log files aren't providing much to go on. I set logging.properties to 'ALL' for the harvester, still only getting errors and Info.

Help!

log extracts from hrv log
17-Apr-2017 16:52:18.260 INFO [https-jsse-nio-8443-exec-4] com.esri.geoportal.harvester.engine.defaults.DefaultProcessor.createProcess SUBMITTING: PROCESSOR: DEFAULT[], SOURCE: WAF[waf-host-url=http://get.iedadata.org/metadata/iso/usap/, waf-pattern=, cred-username=, cred-password=], DESTINATIONS: [GPT[gpt-host-url=http://localhost:8080/geoportal, cred-username=gptadmin, cred-password=, gpt-index=, gpt-cleanup=false]], INCREMENTAL: false, IGNOREROBOTSTXT: true

then... (some entries omitted)

com.esri.geoportal.harvester.api.ex.DataOutputException: Error publishing data: id: http://get.iedadata.org/metadata/iso/usap/600025iso.xml, modified: Wed Apr 05 15:43:19 EDT 2017, source URI: http://get.iedadata.org/metadata/iso/usap/600025iso.xml, broker URI: WAF:http://get.iedadata.org/metadata/iso/usap/
at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:164)
at com.esri.geoportal.harvester.api.base.BrokerLinkActionAdaptor.push(BrokerLinkActionAdaptor.java:64)
.... skip some 'at's
at java.lang.Thread.run(Thread.java:745)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]
at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:216)
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3833)

output to folder cleanup too agressive

The output to folder puts items in folders
so if I say /opt/tomcat/webapps/metadata

for a run for against an service an items put into
/opt/tomcat/webapps/metadata/opentopo.sdsc.edu/
which is the base name of the host.

If I set perform cleanup, it zaps all items in metadata, and not just the host folder.

Option 1: just clean up folder when items will be placed
option 2: use input broker name

Implement DCAT output

allow anonymous harvesting of ArcGIS Online/Portal

currently the ArcGIS Portal input broker requires providing an identity. The harvest only retrieves documents owned by this identity.

desired state is that without providing an identity, the harvester connects anonymously and only retrieves publicly shared items from the indicated portal/online subscription.

How to harvest a Entreprise Geodatabase

Hi
In order to harvest an Entreprise Geodatabase, should I export all the metada into XML metadata files ?
Is It possible to create a connector ?

Best regards

GeoPDF coordinates are reversed

From #61: The coordinates for the bounding box in the generated metadata should be longitude then latitude (e.g. 97 34), according to the schema.

The current harvester code is printing them latitude then longitude (e.g. 34 97).

OAI PMH do not send empty set

fails
https://mercury.ornl.gov/oai/provider?verb=ListIdentifiers&metadataPrefix=fgdc&set=
works
https://mercury.ornl.gov/oai/provider?verb=ListIdentifiers&metadataPrefix=fgdc
if set is empty, do not send.

Automatically harvest upon detected changes or new XML

Automating the process of harvesting would be great. Would it be possible to connect harvester to a script that detects and changes in the folder?

For example if a new XML file is added to a harvested folder it would automatically detect it and begin harvesting.

Set Contributor

Allow for the setting of a contributor in the harvest source

Utilizing Patterns

What language is used when utilizing patterns? For example I want to harvest a folder but I don't want it to harvest from a specific subfolder called "Archive" what pattern would be required?

This would be useful info for a future Wiki.

Improving CSW harvest

USGS sciencebase is a large collection.
Tried twice. Crashed at 32k and 245k records of 6000k.
Need a new techniques with large collections.

ways to pass in a custom filter parameter, they have "collections" which can be used to get smaller sets
resumable/restartable at a record count

https://my.usgs.gov/confluence/display/sciencebase/Catalog+Services

moving issue from catalog to here:
Esri/geoportal-server-catalog#67

OAI and DataOne sources

Ability to harvest from OAI and DataOne endpoints preliminary implemented.
https://github.com/CINERGI/geoportal-server-harvester/tree/dataone

Will require changes to the catalog server to accept these data formats.

enhance harvesting of data.gov through CKAN to include full metadata

currently harvesting of datagov content through CKAN only return a Dublin Core document, full metadata can be extracted through the following parameters:

the datagov ckan api returns harvest_object_id in the extras field, using that value you can get the xml at
/harvest/object/[harvest_object_id]

To tell whether a dataset is harvested from a XML source or datajson source, you can look into the extra field and look for key 'source_datajson_identifier'. If the value is 'true', then the source is from datasjon and the harvest object metadata will be in json format.

examples:
https://catalog.data.gov/api/3/action/package_show?id=u-s-hourly-precipitation-data
https://catalog.data.gov/harvest/object/ac0da4ab-0b88-48c4-af2b-00611df0d956

https://catalog.data.gov/api/3/action/package_show?id=demographic-statistics-by-zip-code-acfc9
https://catalog.data.gov/harvest/object/810835b2-f684-495a-8c0e-d24b15bd2154

Geoportal looks for index [metadata_v1] but harvester indexes to [metadata] index

I now understand why I was having great difficulties at Esri/geoportal-server-catalog#42

I deleted my index.
I had then upgraded to the latest Harvester.
Tried to harvest again but only received errors in Geoportal.
Upgraded to ES 5+ and Geoportal 2.5
Got the same errors

Then brought back my old index through a backup and the previous Geoportal worked again.

Basically it seems the new Harvester saves to the ES [metadata] index instead of [metadata_v1]

This took an entire day to figure out as I couldn't see anything about this in the logs.

Provide a compiled war file

Hi,

You have a downloadable .war file for the catalog app, can you please provide one for the harvester for those of us unable to compile our own.

Thanks,
Marc

Enable query on harvest source

It would be useful to allow specification of a CQL or OGC Filter Query to filter a subset of records for harvesting from a CSW broker. For instance in my case I'd like to harvest only records from the GCMD catalog at https://cmr.earthdata.nasa.gov/csw/collections with a filter &CONSTRAINTLANGUAGE=CQL_TEXT&constraint=AnyText=NSF?PLR

Clarification of installation instructions

Can you please clarify the installation instructions?

After downloading the zip file and unzipping it to C:\temp\geoportal-server-harvester-master I try:

cd c:\temp\geoportal-server-harvester-master
mvn clean install

This runs for 5 seconds then generates the error message:

[ERROR] Failed to execute goal on project geoportal-harvester-csw: Could not res
olve dependencies for project com.esri.geoportal:geoportal-harvester-csw:jar:1.0
.0-SNAPSHOT: Could not find artifact com.esri.geoportal:geoportal-commons-http:j
ar:1.0.0-SNAPSHOT -> [Help 1]

The full error log is here:
install_errors.txt

Am I doing something wrong, or is there a different problem? (This is my first time using MVN).

I'm using Apache Tomcat 8.0.35 and jdk1.8.0_92 on Windows 7. Thanks

Add Task adds multiple outputs

If you add a task, it adds multiple outputs for the second, and third times

Slow harvest from CSW

https://data.ioos.us/csw

6 records a minute.
Returning 10 records at a time comes back in about 3 seconds (start 1, and 1001)

Saw this before with
http://search.geothermaldata.org/csw

and attributed it to some server/pyCSW issue.

Now it's feeling like it might be something on the harvester side.

IOOS had an issue, and has now fixed it. (turned out to be a python 3 string issue)

scheduler not working

I'm trying to set a schedule for harvest jobs, using harvester release 2.5.1. I've set up several tasks, and I want to schedule them. I'm working with a new VM, CentOS 7.3., geoportal 2.5.0 runnign on same machine. SELunix is set to permissive. Harvesting works when I run the tasks (the input brokers are WAF, Output is localhost geoportal).
When I click on 'schedule' a little modal dialog opens with titled 'Task', with a 'Type' combo box and a 'Submit' button. The menu in the 'Type' combo box is empty. Clicking submit does nothing.
In the hrvyyy-mm-dd.log file, I see this message that appears to be related to trying to schedule:

... SEVERE [http-nio-8080-exec-1] org.apache.catalina.core.StandardWrapperValve.invoke Servlet.service() for servlet [spring] in context with path [/harvester] threw exception [Request processing failed; nested exception is java.util.MissingResourceException: Can't find bundle for base name EngineResource, locale en_US] with root cause
...
at com.esri.geoportal.harvester.engine.triggers.AtTrigger.getTemplate(AtTrigger.java:95

)

Debugging Failed Harvest Entries

What is the best way to debug the cause of failed harvest entries?
After evaluating the hrv.log I get a count of the No. succeeded and the no. Failed.

Failed entries look like this:

Nov 03, 2016 3:52:41 PM com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess lambda$null$0
WARNING: Failed harvesting REF \\UNCPATH.xml | Mon Aug 15 09:15:28 MDT 2016 | \\UNCPATH.xml during UNC [\\UNCPATH] --> [GPT [http://server:8088/geoportal/]]
Nov 03, 2016 3:52:41 PM com.esri.geoportal.harvester.support.ReportLogger error
SEVERE: Error processing task: PROCESS:: status: working, title: UNC [\\UNCPATH] --> [GPT [http://server:8088/geoportal/]]
com.esri.geoportal.harvester.api.ex.DataOutputException: Error publishing data.
	at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:141)
	at com.esri.geoportal.harvester.api.base.BrokerLinkActionAdaptor.push(BrokerLinkActionAdaptor.java:64)
	at com.esri.geoportal.harvester.api.base.SimpleLink.push(SimpleLink.java:71)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$null$0(DefaultProcessor.java:135)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
	at com.esri.geoportal.harvester.engine.defaults.DefaultProcessor$DefaultProcess.lambda$new$1(DefaultProcessor.java:133)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.client.HttpResponseException: Bad Request
	at com.esri.geoportal.commons.gpt.client.Client.publish(Client.java:131)
	at com.esri.geoportal.harvester.gpt.GptBroker.publish(GptBroker.java:131)
	... 7 more

inside the geoportal.log file I see the entries:

2016-11-03 15:52:34,293 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,316 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,332 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,349 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,471 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,877 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,913 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:34,939 DEBUG [com.esri.geoportal.context.AppResponse] - Unrecognized metadata type.
2016-11-03 15:52:35,091 DEBUG [com.esri.geoportal.base.util.DateUtil] - Bad ISO date: REQUIRED: The year (and optionally month, or month and day) for which the data set corresponds to the ground.
2016-11-03 15:52:35,093 DEBUG [com.esri.geoportal.base.util.DateUtil] - Bad ISO date: REQUIRED: The year (and optionally month, or month and day) for which the data set corresponds to the ground.
2016-11-03 15:52:35,527 DEBUG [com.esri.geoportal.context.AppResponse] - Validation exception.

Basically tons of Unrecognized metadata type and Validation exception errors. For Validation errors, how would I know what is failing? The Bad ISO date error is great as it clearly tells me whats going wrong.
Thanks!