This article covers everything you need to know about parsing prices in Python. By the end of this article, you will be able to extract price and currency from raw text, which is usually collected from external data sources like web pages or files.
This article covers a specific case - converting unprocessed text containing price and any other money-like data into price and currency. This is a common problem that needs to be addressed in cases of web scraping, data cleaning, and similar projects where the currency needs some cleaning.
The currency data can be in various formats, varying, especially with currency code in the locale. Here are few examples:
'29,99 €' # Comma as decimal separator
'59,90 EUR'
'$ 9.99' # Period as decimal separator
'9.99 USD'
'Price: $12.99'
'¥ 170 000'
'12,000원'
'₹898.00'
'Price From £ 39.95'
As it is evident that every format needs a different way of extracting the currency and the price value.
This article will first examine how it can be done using Python Standard Library. In the second part, a specialized library for this purpose called Price Parser will be examined.
The best way to extract numeric data from any string is to use regular expressions. Regular expression operations can be performed using Python's re
module.
This is going to be a two-part process. The first part is building a regular expression, and the second part is using this expression.
Regular Expressions can be difficult for beginners. Websites like Regexr can help build and test regular expressions.
Let's start with an example.
- The digits can be matched with regex expression
\d
. - The price can also have a comma and a period. These can be selected using
.
and,
. - The price can also contain space. The whitespace characters can be selected using
\s
. - All above need to be put in a set using
[]
. - There can be zero or more instances of the above. We need to use a wild card
*
for this. - If we put together all these in a set, the resulting regular expression will be like this:
expression = '[\d\s.,]*'
This expression will match the space outside the currency as well. This can be fixed easily by appending \d
to ensure that even if space is matched, it ends with a number.
expression = '[\d\s.,]*\d'
In this statement, all the characters from the set contained in the squared brackets will be matched, and an asterisk will ensure that 0 or more of these can be matched.
If you are building this regular expression at Regexr, you can paste in all the strings and see what is getting matched.
Now we can put together the Python code required to extract the price.
import re
expression = '[\d\s.,]*\d'
price = '9.99 USD'
match = re.search(expression,price)
if match:
print(match.group(0)) # Prints 9.99
This regular expression can extract prices from most of the cases.
There are some problems, though. The first problems are the decimal separator and the thousand separators. In some locales, a comma is used for decimal separators, and a period or space is used for thousand separators.
For example, ten thousand euros and forty-five cents are written in these two possible ways.
10 000,45
10.000,45
On the other hand, other locales use a comma as the decimal separator and period as the thousand separators. In that case, the same price would be written like this:
10,000.45
To handle this, there will be a need to write more code.
So far, we did not talk about getting the currency and cleaning up the price.
As we add more cases, the code will become more and more complex.
This is where it is important to know that there is already a package published on the Python Package Index —Price Parser.
Price Parser is a package that aims to make parsing extracting price and currency from raw string much easier, without going into the complexities to handling all the cases.
Even though Price Parser is written for parsing scraped price string, it works with files or any other data source that contain currency information in raw, unprocessed strings.
Price Parser package can be installed easily from the terminal.
pip install date-parser
If you are using Anaconda, note that it is NOT available on any channel, and thus you will have to use pip install
command there as well.
conda install date-parser # DOES NOT WORK. Use pip install instead.
pip install date-parser # works
Before we explore how this package can be used, first, we must understand the class Price
of this package..
Price Parser package uses a special class to represent the currency. This class is aptly named Price
. The Price
class has two important attributes:
amount
: Price value asDecimal
currency
: Currency asstring
It also has a data descriptor amount_float
, that will return the currency amount as float
, instead of Decimal()
.
With this in mind, let's see how we can parse price.
Price Parser supports two ways of parsing. These are not different ways but rather just alias. They do the same thing.
These two functions are:
price_parser.parse_price
Price.fromstring
The signature of these functions is as follows:
fromstring(price, currency_hint= None, decimal_separator= None)
The only mandatory parameter is the price string. The other two parameters are optional. This function will return an instance of Price
.
Here is the first example. In this example, it can be seen how the amount
, amount_float
, and currency
can be used.
>>> from price_parser import Price
>>> price = Price.fromstring('29,99 €')
>>> print(price)
Price(amount=Decimal('29.99'), currency='€')
>>> price.amount
Decimal('29.99')
>>> price.amount_float
29.99
>>> price.currency
'€'
For simple cases, supplying only the price string is enough for the method to work. Here are few more examples:
>>> Price.fromstring('59,90 EUR')
Price(amount=Decimal('59.90'), currency='EUR')
>>> Price.fromstring('12,000원')
Price(amount=Decimal('12000'), currency='원')
The fromstring
function can handle uncleaned strings and extract price amount and currency for most of the cases.
>>> Price.fromstring('Price From £ 39.95')
Price(amount=Decimal('39.95'), currency='£')
Often, € sign is used as the decimal separator. This is handled by Price Parser without any additional tweaking.
>>> Price.fromstring('29€ 99 ')
Price(amount=Decimal('29.99'), currency='€')
The previously discussed difficult cases, where space or a period can be used as thousands of separators are supported out of the box.
>>> Price.fromstring('10.000,45€')
Price(amount=Decimal('10000.45'), currency='€')
>>> Price.fromstring('10 000,45€')
Price(amount=Decimal('10000.45'), currency='€')
Both 0
and free
are interpreted as 0
. Further, if amount or currency cannot be determined by Price Parser, it will return a Price
object with amount
and/or currency
will be set to None
.
>>> Price.fromstring('free')
Price(amount=Decimal('0'), currency=None)
>>> Price.fromstring('$')
Price(amount=None, currency='$')
>>> Price.fromstring('0')
Price(amount=Decimal('0'), currency=None)
>>> Price.fromstring('not available')
Price(amount=None, currency=None)
Currency hint can be used if Price Parser is not able to detect the currency. For example, United Arab Emirates Dirham is written as both AED and Dhs. Price Parser can handle AED, but can not handle Dhs. In this case, currency_hint
is useful.
>>> Price.fromstring('29 AED')
Price(amount=Decimal('29'), currency='AED') # correct currency
>>> Price.fromstring('29 Dhs')
Price(amount=Decimal('29'), currency=None) # currency is None
>>> Price.fromstring('29 Dhs', currency_hint="AED") # send currency_hint
Price(amount=Decimal('29'), currency='AED') # currency is correct
Finally, if the decimal separator is known, it can be set using the decimal_separator
argument. This will ensure that there are no errors in guessing the decimal separator. Here is an example:
>>> Price.fromstring("Price: $140,600") # It's actually 140.600
Price(amount=Decimal('140600'), currency='$') # Wrong output
>>> Price.fromstring("Price: $140,600", decimal_separator=",")
Price(amount=Decimal('140.600'), currency='$') # Correct output