Currently, is there is no valid proxy in 5 tries on any exporter, the DAG fails. The usual cause of errors is proxy health issue (event after 5 retries), like in the below logs:
*** Reading local file: /usr/local/airflow/logs/rss_news_dag/exporting_101greatgoals_news_to_broker/2020-10-11T08:10:00+00:00/1.log
[2020-10-11 08:21:15,995] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: rss_news_dag.exporting_101greatgoals_news_to_broker 2020-10-11T08:10:00+00:00 [queued]>
[2020-10-11 08:21:16,029] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: rss_news_dag.exporting_101greatgoals_news_to_broker 2020-10-11T08:10:00+00:00 [queued]>
[2020-10-11 08:21:16,029] {{taskinstance.py:866}} INFO -
--------------------------------------------------------------------------------
[2020-10-11 08:21:16,029] {{taskinstance.py:867}} INFO - Starting attempt 1 of 1
[2020-10-11 08:21:16,029] {{taskinstance.py:868}} INFO -
--------------------------------------------------------------------------------
[2020-10-11 08:21:16,053] {{taskinstance.py:887}} INFO - Executing <Task(PythonOperator): exporting_101greatgoals_news_to_broker> on 2020-10-11T08:10:00+00:00
[2020-10-11 08:21:16,055] {{standard_task_runner.py:53}} INFO - Started process 2164 to run task
[2020-10-11 08:21:16,162] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: rss_news_dag.exporting_101greatgoals_news_to_broker 2020-10-11T08:10:00+00:00 [running]> cfc5513180c6
[2020-10-11 08:21:16,195] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,195] {{retry_on_exception.py:14}} INFO - Retries: 5
[2020-10-11 08:21:16,201] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,200] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:16,201] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,201] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:16,202] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,202] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:16,307] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,307] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:16,307] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,307] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:16,366] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:16,366] {{main.py:20}} INFO - {'http': 'http://181.129.70.82:46752', 'https': 'http://181.129.70.82:46752'}
[2020-10-11 08:21:46,395] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,394] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7373c37190>, 'Connection to 181.129.70.82 timed out. (connect timeout=30)'))
[2020-10-11 08:21:46,395] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,395] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:21:46,396] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,396] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection.
[2020-10-11 08:21:46,397] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,396] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:21:46,397] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,397] {{retry_on_exception.py:29}} INFO - Retries: 4
[2020-10-11 08:21:46,399] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,399] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:46,400] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,399] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:46,400] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,400] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:46,505] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,505] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:46,505] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,505] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:46,513] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,513] {{main.py:20}} INFO - {'http': 'http://185.74.4.47:8080', 'https': 'http://185.74.4.47:8080'}
[2020-10-11 08:21:46,743] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,743] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
[2020-10-11 08:21:46,744] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,743] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:21:46,744] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,744] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection.
[2020-10-11 08:21:46,745] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,745] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:21:46,745] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,745] {{retry_on_exception.py:29}} INFO - Retries: 3
[2020-10-11 08:21:46,748] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,748] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:46,748] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,748] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:46,749] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,749] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:46,854] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,854] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:46,854] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,854] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:46,856] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,856] {{kafka.py:461}} INFO - Kafka producer closed
[2020-10-11 08:21:46,859] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:46,858] {{main.py:20}} INFO - {'http': 'http://165.22.36.75:8888', 'https': 'http://165.22.36.75:8888'}
[2020-10-11 08:21:47,811] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,811] {{web_parser.py:32}} INFO - Bad response
[2020-10-11 08:21:47,812] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,812] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:21:47,812] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,812] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection.
[2020-10-11 08:21:47,813] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,813] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:21:47,813] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,813] {{retry_on_exception.py:29}} INFO - Retries: 2
[2020-10-11 08:21:47,816] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,816] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:21:47,817] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,816] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:21:47,817] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,817] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:21:47,923] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,923] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:21:47,923] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,923] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:21:47,927] {{logging_mixin.py:112}} INFO - [2020-10-11 08:21:47,927] {{main.py:20}} INFO - {'http': 'http://139.5.71.199:8080', 'https': 'http://139.5.71.199:8080'}
[2020-10-11 08:22:17,949] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,949] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7373a8ae50>, 'Connection to 139.5.71.199 timed out. (connect timeout=30)'))
[2020-10-11 08:22:17,950] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,950] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:22:17,951] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,950] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection.
[2020-10-11 08:22:17,951] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,951] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:22:17,952] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,951] {{retry_on_exception.py:29}} INFO - Retries: 1
[2020-10-11 08:22:17,954] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,954] {{conn.py:378}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: connecting to kafka:9092 [('172.19.0.10', 9092) IPv4]
[2020-10-11 08:22:17,955] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,955] {{conn.py:1195}} INFO - Probing node bootstrap-0 broker version
[2020-10-11 08:22:17,956] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:17,955] {{conn.py:407}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connecting> [IPv4 ('172.19.0.10', 9092)]>: Connection complete.
[2020-10-11 08:22:18,061] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,060] {{conn.py:1257}} INFO - Broker version identified as 1.0.0
[2020-10-11 08:22:18,061] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,061] {{conn.py:1259}} INFO - Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
[2020-10-11 08:22:18,065] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,065] {{main.py:20}} INFO - {'http': 'http://185.74.4.47:8080', 'https': 'http://185.74.4.47:8080'}
[2020-10-11 08:22:18,302] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,302] {{web_parser.py:34}} INFO - Error occurred: HTTPSConnectionPool(host='www.101greatgoals.com', port=443): Max retries exceeded with url: /feed/ (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
[2020-10-11 08:22:18,302] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,302] {{kafka.py:471}} INFO - Closing the Kafka producer with 9223372036.0 secs timeout.
[2020-10-11 08:22:18,303] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,303] {{conn.py:916}} INFO - <BrokerConnection node_id=bootstrap-0 host=kafka:9092 <connected> [IPv4 ('172.19.0.10', 9092)]>: Closing connection.
[2020-10-11 08:22:18,304] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:18,304] {{retry_on_exception.py:20}} INFO - Error occured: Not a valid XML document
[2020-10-11 08:22:18,304] {{taskinstance.py:1128}} ERROR - Not a valid XML document
Traceback (most recent call last):
File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1637, in close
self.parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.7/site-packages/atoma/utils.py", line 33, in parse_xml
return defused_xml_parse(xml_content)
File "/usr/local/lib/python3.7/site-packages/defusedxml/common.py", line 105, in parse
return _parse(source, parser)
File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 605, in parse
self._root = parser.close()
File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1639, in close
self._raiseerror(v)
File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1531, in _raiseerror
raise err
File "<string>", line None
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/modules/retry/retry_on_exception.py", line 22, in wrapper
self._raise_on_condition(self._retries, err)
File "/usr/local/airflow/modules/retry/retry_on_exception.py", line 27, in _raise_on_condition
raise exception
File "/usr/local/airflow/modules/retry/retry_on_exception.py", line 17, in wrapper
return function(*args, **kwargs)
File "/usr/local/airflow/modules/rss_news/main.py", line 21, in export_news_to_broker
for news in NewsProducer(rss_feed).get_news_stream(proxy):
File "/usr/local/airflow/modules/rss_news/rss_news_producer.py", line 34, in get_news_stream
news_feed_items = self._extract_news_feed_items(proxies)
File "/usr/local/airflow/modules/rss_news/rss_news_producer.py", line 30, in _extract_news_feed_items
news_feed = atoma.parse_rss_bytes(content)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/atoma/rss.py", line 217, in parse_rss_bytes
root = parse_xml(BytesIO(data)).getroot()
File "/usr/local/airflow/.local/lib/python3.7/site-packages/atoma/utils.py", line 35, in parse_xml
raise FeedXMLError('Not a valid XML document')
atoma.exceptions.FeedXMLError: Not a valid XML document
[2020-10-11 08:22:18,307] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=rss_news_dag, task_id=exporting_101greatgoals_news_to_broker, execution_date=20201011T081000, start_date=20201011T082115, end_date=20201011T082218
[2020-10-11 08:22:21,351] {{logging_mixin.py:112}} INFO - [2020-10-11 08:22:21,351] {{local_task_job.py:103}} INFO - Task exited with return code 1